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f— ^ ^ The sophisticated and automated means of data collection used 

^^ . by an increasing number of institutions and companies leads to ex- 

^^ ' tremely large data sets. Subset selection in regression is essential 

0^ . when a huge number of covariates can potentially explain a response 

variable of interest. The recent statistical literature has seen an emer- 
gence of new selection methods that provide some type of compromise 
'^ ' . between implementation (computational speed) and statistical opti- 

^M ' mality (e.g., prediction error minimization). Global methods such as 

^^ ' Mallows' Cp have been supplanted by sequential methods such as 

stepwise regression. More recently, streamwise regression, faster than 
the former, has emerged. A recently proposed streamwise regression 
'•^1 , approach based on the variance inflation factor (VIF) is promising, 

but its least-squares based implementation makes it susceptible to 
the outliers inevitable in such large data sets. This lack of robustness 
^ . can lead to poor and suboptimal feature selection. In our case, we 

0^ ' seek to predict an individual's educational attainment using economic 

^ , and demographic variables. We show how classical VIF performs this 

task poorly and a robust procedure is necessary for policy makers. 
This article proposes a robust VIF regression, based on fast robust 
^^ . estimators, that inherits all the good properties of classical VIF in 

^^ ' the absence of outliers, but also continues to perform well in their 

presence where the classical approach fails. 
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1. Introduction. Data sets with millions of observations and a huge num- 
ber of variables are now quite common, especially in business- and finance- 
C^ , related fields, as well as computer sciences, health sciences, etc. An impor- 

tant challenge is to provide statistical tools and algorithms that can be used 
with such data sets. In particular, for regression models, a first data analysis 
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requires that the number of potential explanatory variables be reduced to 
a reasonable and tractable amount. Consider p potential explanatory vari- 
ables [1 xi---Xp]^ = x and a response variable y observed on n subjects. 
The classical normal linear model supposes y|x ~ A^(x-^/3;(T^) with slope 
parameters (3 = [/3o,/3i, . . . ,/3p] . The aim is to find a subset of explanatory 
variables that satisfies a given criterion and such that the regression model 
holds. 

The selection criteria are numerous and can be based on prediction, fit, 
etc. The available selection procedures can be broadly classified into three 
classes according to their general strategy and, as a result, their computa- 
tional speed. A first class considers all the possible combinations of covari- 
ates as potential models, evaluates each according to a fixed criterion, and 
chooses the model which best suits the selected criterion. A second class is 
formed of sequential selection procedures in which a covariate at a time is 
entered in (or removed from) the model, based on a criterion that can change 
from one step to the next and that is computed for all potential variables 
to enter (or to exit) until another criterion is reached. Finally, the third 
class of selection procedures is also sequential in nature, but each covariate 
is only considered once as a potential covariate. For the first class, we find 
criteria such as the AIC [Akaike (1973)], BIG [Schwarz (1978)], Mallows' Cp 
[Mallows (1973)], cross-validation, etc. [see also Efron (2004)]. These meth- 
ods are not adapted to large data sets since the number of potential models 
becomes too large and the computations are no longer feasible. For the sec- 
ond class, we find, for example, the classical stepwise regression which can 
be considered as a simple algorithm to compute the estimator of regres- 
sion coefficients (3 that minimizes an Iq penalized sum of squared errors 
]]y — X/3]]| + Ag]]/3]]ig, with q = {) and X = [1 Xj]j=i^...^p and 1 a vector of 
ones, that is, \\(3\\iq = Z]?=i^(/3j 7^0) [see Lin, Foster and Ungar (2011)], 
with r]{l3j 7^ 0) = 1 if /3j 7^ and otherwise. Fast algorithms for stepwise 
regressions are available, for example, Foster and Stine (2004). Procedures 
for the /i problem are also available, for example, Lasso/LARS [Efron et al. 
(2004)], the Dantzig Selector [Candes and Tao (2007)], or coordinate de- 
scent [Friedman, Hastie and Tibshirani (2010)]. But these algorithms may 
also become very slow for large data sets, not only because all remaining 
variables are evaluated at each stage, but also because the penalty \q needs 
to be computed, and often via cross-validation. The last class is a variation 
of stepwise regression in which covariates are tested sequentially but only 
once for addition to the model. An example is the streamwise regression of 
Zhou et al. (2006), which uses the a-investing rule [Foster and Stine (2008)], 
is very fast, and guards against overfitting. An improved streamwise regres- 
sion approach was recently proposed in Lin, Foster and Ungar (2011) where 
a very fast to compute test statistic based on the variance inflation factor 
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(VIF) of the candidate variable, given the currently selected model, is pro- 
posed. The approach takes into account possible multicollinearity, seeking 
to find the best predictive model, even if it is not the most parsimonious. 
Comparisons in Lin, Foster and Ungar (2011) establish that the method 
performs well and is the fastest available. 

Our concern in this paper is to provide model selection tools for the re- 
gression model that are robust to small model deviations. As argued in 
Dupuis and Victoria-Feser (2011) [see also Ronchetti and Staudte (1994)], 
spurious model deviations such as outliers can lead to a completely differ- 
ent, and suboptimal, selected model when a nonrobust criterion, like Mal- 
lows' Cp or the VIF, is used. This happens because under slight data con- 
tamination, the estimated model parameters, using, for example, the least 
squares estimator (LS) and, consequently, the model choice criterion, can 
be seriously biased. The consequence is that when the estimated criteria 
are compared to an absolute level (like a quantile of the y^ distribution), 
the decisions are taken at the wrong level. For the first class of selection 
procedures, robust criteria have been proposed such as the robust AIC of 
Ronchetti (1982), the robust BIC of Machado (1993), the robust Mallows' 
Cp of Ronchetti and Staudte (1994), and a robust criterion based on cross- 
validation (CV) in Ronchetti, Field and Blanchard (1997). Since standard 
robust estimators are impossible to compute when the number of covariates 
is too large, Dupuis and Victoria-Feser (2011) proposed the use of a forward 
search procedure together with adjusted robust estimators when there is a 
large number of potential covariates. Their selection procedure, called Fast 
Robust Forward Selection (FRFS), falls in the second class of selection pro- 
cedures. FRFS outperforms classical approaches such as Lasso/LARS when 
data contamination is present and outperforms, in all studied instances, a 
robust version of the LARS algorithm proposed by Khan, Van Aelst and 
Zamar (2007). 

However, although FRFS is indeed very fast and robust, it too can become 
quite slow when the number of potential covariates is very large, as all 
covariates are reconsidered after one is selected for entry in the model. It is 
therefore important to have a robust selection procedure in the streamwise 
regression class so that very large data sets can be analyzed in a robust 
fashion. In this paper we develop a robust VIF approach that is fast, very 
efficient, and clearly outperforms nonrobust VIF in the presence of outliers. 

The remainder of the paper is organized as follows. In Section 2 we review 
the classical VIF approach and present our robust VIF approach. A simu- 
lation study in Section 3 shows the good performance of the new approach. 
In Section 4 we analyze educational attainment data and show how policy 
makers are better served by robust VIF regression than by classical VIF or 
Lasso. In Section 5 we present a shorter analysis of a large crime data set 
that highlights more problems with classical VIF for real data. Section 6 
contains a few closing remarks. 
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2. Robust VIF regression. 

2.1. The classical approach. Lin, Foster and Ungar (2011) propose a pro- 
cedure that allows one to sweep through all available covariates and to enter 
those that can reduce a statistically sufficient part of the variance in the 
predictive model. Let X^ be the design matrix that includes the selected 
variables at a given stage, and X5 = [X5 Zj] with Zj the new potential co- 
variate to be considered for inclusion. Without loss of generality, we suppose 
all variables have been standardized. Consider the following two models: 

(1) y = Xs/3s + Zj/3j+estep, s^tep^ N{Q,al^^^l), 

(2) rs = Zjjj + estage , estagc ~ A^(0, 0-,^tageI) > 

where rg = (I — X5(X^Xs)~-'^X^)y are the residuals of the projection of 
y on Xg. All known estimators of the parameters /3j,o"step and 7j,iTstage 
will provide different estimates when the covariates present some degree of 
multicollinearity, and, consequently, significance tests based on estimates of 
/3j or 7j do not necessarily lead to the same conclusions. While in stepwise 
regression the significance of /3j in model (1) is at the core of the selection 
procedure, in streamwise regression one estimates more conveniently 7^. Lin, 
Foster and Ungar (2011) show that, when LS are used to estimate, 7^ = p/3j 
where p = zj(l — X5(X^X5)~^X^)zj. They then compare T-y = jj/{p^''^a), 
with suitable estimates for p and a, to the standard normal distribution 
to decide whether or not Zj should be added to the current model. The 
procedure is called VIF regression since Marquardt (1970) called 1/p the 
VIF for Zj. 

2.2. A robust weighted slope estimator. Since the test statistic T^ is 
based on the following, (1) the LS estimator 7^, (2) p, in turn based on the 
design matrix X5 and Zj, and (3) the classical estimator of a, it is obviously 
very sensitive to outliers, a form of model deviation. An extreme response 
or a very badly placed design point can have a drastic effect on T^. The 
latter is then compared to the null distribution: the correct asymptotic dis- 
tribution under the hypothesis that the regression model holds. With model 
deviations, the null distribution is not valid and, hence, selection decisions 
(to add the covariate or not) are taken rather arbitrarily. We propose here 
to limit the influence of extreme observations by considering weighted LS 
estimators of the form 

(3) 3 = (x'"^X"')~^X"'^y"', 



with X"" = diag(wit;^)X and y"" = diag('i /tt;^)y. The weights w^ depend on 
the data and are such that extreme observations in the response and/or 
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in the design have a nil or hniited effect on the value of /3. Dupuis and 
Victoria-Feser (2011) propose Tukey's redescending biweight weights 

(4) Mrf,c) = H'y7) -') ^ ^^l^^l^^' 

lo, if \ri\ >c, 

where rj = (yj — :>c[(3)/a are standardized residuals that are computed in 
practice for chosen estimators of /3 and a (see below). The constant c controls 
the efficiency and the robustness of the estimator. Indeed, the most efficient 
estimator is the LS estimator, that is, (3) with all weights equal to one 
(i.e., c— >-oo), but it is very sensitive to (small) model deviations, while 
a less efficient but more robust estimator is obtained by downweighting 
observations that have a large influence on the estimator, that is, by setting 
c < cxD in (4). The value c = 4.685 corresponds to an efficiency level of 95% 
for the robust estimator compared to the LS estimator at the normal model 
and is the value used throughout the paper. 

We follow Dupuis and Victoria-Feser (2011) and use for the weights 
w^ = Wi{r^;c) in (3), where the residuals r^ = {i/i — :Kf/3^)/a^ and ct" = 
1.483 med \f^ — med(f^)| , the median absolute deviation (MAD) of the resid- 
uals fO = yi- xf3°- The slope estimates are 3° = [(XJf)^X^]^i(X^2)Ty^ 
with X^ = [1 ^/uHxXix ■■■ ^/m^Xip] and X.'^'^ = [1 wnxn ■■■ WipXip],i = 
l,...,n, with weights Wij, for all j = l,...,p, computed using (4) at the 

residuals r^ = {tn — /Sqj — Xij/3j)/aj, with aj = MAD(yj — /3oj — xijfij). The 
slope estimators f3i, . . . ,f3p and the intercept estimators /3oi , • • • , /3op are com- 
puted on the p marginal models y = /3oi + xif3i+ ei,. . . ,y = /3op + Xpf3p + Sp 
using a robust weighted estimator defined implicitly through 

n 

(5) '^Wi{ri;c)riXi=0. 

i=l 

Here we consider Ruber's weights given for the regression model by 

(6) Wi{ri;c) = min<^ 1; — 

with c = 1.345. Estimators in (5) belong to the class of M-estimators [Huber 
(1964, 1967)]. With (6) in (5), the marginal intercepts and slope estimators 
are simpler (and faster) to compute than the ones based on Tukey's biweight 
weights as originally proposed in Dupuis and Victoria-Feser (2011). For the 
scale in the weights in (5), we propose to use the MAD of the residuals. 

The estimator in (3) is a one-step estimator that is actually biased when 
there is multicollinearity in the covariates. Dupuis and Victoria-Feser (2011) 
show that the bias can be made smaller and even nil if /3 = /3^ is iterated fur- 
ther to get, say, /3^, computed at the updated weights wj,. . . ,w-~ based on 
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the residuals rf ^ = (y, - ^fp^^^)/a^^\. . .,rl''-^'^ = {yi - ^f^(^-^^)/a^^-^\ 
In the simulation study in Section 3, however, we find that the bias is very 
small even with relatively large multicollinearity, so that in practice there is 
often no need to proceed with this iterative correction. 

Finally, j3^ is a coordinate-wise robust estimator and Alqallaf et al. (2009) 
show, through the computation of a generalized version of the influence 
function [Hampel (1968, 1974)] and different contamination schemes in the 
multivariate normal (MVN) setting, that coordinate-wise robust estimators 
can be less sensitive to extreme observations when they occur independently 
at the univariate level. 

2.3. Robust VIF selection criterion. Let X^ = diag{i/ w'j^g)'Ks be the 

weighted design matrix at stage S with, say, q columns (hence q — 1 covari- 
ates), and zj = diag{^ Wij)zj the new candidate covariate that is evaluated 
at the current stage S + 1 . One could use the weights w^g for z"^ instead of 
the weights Wij computed at the marginal models with only Zj as a covariate, 
but this would require more computational time. The simulation results in 
Section 3 show that one gets very satisfactory results with Wij . Let also Xg = 
[X^lzJ'] and define Pf as the last element of the vector [X^^X^J-^X^'^y"' 

with y"' = diag(A /ii;^g)y. j3j^ is actually a robust estimator of (3j in (1). Let 

H^ = X^(X^^X^)-iX^^ and 3^ = (X^^X^)-iX^'^y"', then 

- (Zj- Zj- -Zj H^Zj-j Zj (y -}(^s[3s) 
= {zj^zj - zJ^U'^zjy^zJ^r'S 

= {zj'^zj - zJ^U'^zJ)~\zJ^zJ){zj'^zJ)-\j^r'§, 
where r^ are the residuals of the weighted fit of y"' on X^. Let 

p"- = (zj'^z^"')-^(z^"'^zj' - zJ^H^zJ), 
then 

with 7^ = {z'^'^zj)~^z'^'^r'g, that is, the weighted estimator of the fit of zJ 

on the weighted residuals r^, that is, model (2). Note, however, that /3j" is 

not equal to the last element of /^^.^i unless the weights w^g are used for 
z^. Note also that we can write 

p - 1 - Kjs , 
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with 

(7) Rjs = Zj tlsZj {zj zj ) 

a robust estimate of the coefficient of determination R^ . Renaud and Victoria- 
Feser (2010) propose a robust R^ based on weighted responses and covariates 
and (7) is equivalent to their proposal (with a = 1, see their Theorem 1) but 
with other sets of weights. Moreover, p^ is the partial variance of z^ given 



X 



g [See Dupuis and Victoria- Feser (2011)]. 



Lin, Foster and Ungar (2011) note that using all the data to compute p (in 
the classical setting) is quite computationally expensive and they propose 
a subsampling approach. For the same reason, we also propose to actually 
estimate p^ by computing (7) on a randomly chosen subset of size m = 200. 

To derive the t-statistic based on 7^, we follow Lin, Foster and Ungar 
(2011) who base their comparison on the expected value of the estimated 

variance of, respectively, /3j' and 7J'. Let dgi^p ^^d dgtag^ be, respectively, 
robust residual variance estimates for models (1) and (2). Let also A.u)ij) 

denote the element (i, j) of matrix A. For /3"', supposing that Wij/w^ ^ 1, 
we can use 



Yar{^f 



a: 



step 
'stepV 



[X^^X' 



cr, 



wT w 
-■3 ^3 

-1/ 



-1 
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tis^j ) 

-1-1 
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with 



(8) 



6 



+ 1 d<^{r) 



1 d<^{r) 



and $ the standard normal cumulative distribution [see Heritier et al. (2009), 
equation (3.20)]. For 7^, based on the model with r^ as the response and 
zf as the explanatory variable (without intercept), we have 



Var(7]" 



(T„. 



A^ 



-1^-1 



stage V^j ^j 



cr. 



stage 

n 






n 



{4 



with e~^ the efficiency of a robust slope estimator computed using Ruber's 
weights relative to the LS, which is not equal to e~^, the efficiency of a 
robust slope estimator computed using Tukey's weights relative to the LS. 
We will see below that the computation of the former is not needed. Hence, 
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approximating (Tg^gp ss (Tg-i-age = <5' , we have 



Var(/3"') « (p"')-V^K^)(e,/g,)-^ 



An honest approximate robust test statistic Tyj is then given by 



\/Var(/3-) ^(p-)-iv;^)(e,/e,)-i' 



7r 



-c 



that is, 

(9) T^ = (p-)-i/2, 

/<TVn(l/nE.^.f)-^ee 

with cj^ a robust mean squared error for the model with r^ as response 
and zj as explanatory variable [i.e., model (2)]. We use a = MAD(r5! — 

Our fast robust evaluation procedure is summarized by the following five 
steps. Suppose that we are at stage S and a set of g — 1 covariates has been 
chosen in the model. We are considering covariate Zj for possible entry. 
We are working with c = 4.685 and have computed Cc and the weights Wij 
and w^gi 

(1) Obtain the residuals r^ = y"' - X^(X^^X^)-iX^^y«'. 

(2) Set z7 = dmg{^/w-)zj. Compute ^f = {zY^zJ)-^z':"^r'^ and a = 



MAD(r^ - zjizj^z^y^zf^r"^). 

(3) Sample a small subset X = {^i, . . . ,im} G {1, . ■ . ,n} of the observations 
and let jx denote the corresponding subsample from the regressor x. 



(4) Let iH^ = xX^(jX^^xX^)-iiX^^, compute RJ^ = xzJ^iU'^xz 
'.J^xzJ)-\ and find p'" = l- RJ^ 



X 



(izj^izp-i, and find p"^ = 1- R""^ 

(5) Compute the approximate t-ratio T^, = {p'^y^^'^jf / J^HTd^^fY^ 
and compare it to an adapted quantile to decide whether or not to add Zj 
to the current set. 

A more detailed algorithm in which the decision rule (whether or not to 
add the new variable) is also specified is given in the Appendix. Note that 
in Step 5 above, the rejection quantile, or corresponding probability Uj, 
is adapted at each step j so that Uj increases/decreases if a rejection is 
made/not made. As explained in Lin, Foster and Ungar (2011), one can 
think of aj as a gambler's wealth and the game is over when aj < 0. 

2.4. Comparison with the robust t-statistic of FRFS. The i-statistic pro- 
posed by Dupuis and Victoria-Feser (2011) [equation (5)] and used to test 
whether a candidate covariate is entered in the current model can be written 
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with yj' = diag(^t(;jj)y. Supposing that y'f ss y"' and nj^Wij ?» 1, then 

\2 ,,t«r„ 



(tD' y""z^ 



r«) 



Hence, T^ in (9) and T^ in (10) differ by a multipHcative factor of 



wT w 



which is the ratio of the robustly estimated covariance between Zj and y, 
and the robustly estimated partial covariance between Zj and y given X5. 
One can notice that in the orthogonal case (and standardized covariates), 
we have z^ H^ ~ so that k ~ 1. The value of k was computed in some 
of the simulations outlined in the following section. While k maintained 
a median value of 1 when aggregating over the 200 simulated data sets 
at a given setting, its variability changed with the theoretical R^ and the 
absence or presence of outliers. For example, the interquartile range went 
from a value near for R^ = 0.20 and no outliers, to 5 for R^ = 0.80 and 5% 
outlying responses with high leverage in the p = 100 case. There can thus 
be a considerable difference in the two test statistics. 

3. Simulation study. We carry out a simulation study to assess the effec- 
tiveness of the model selection approaches outlined above. First, we create 
a linear model 

(11) y = Xi + X2 + --- + Xk + (Je, 

where Xi,X2,...,Xfc are multivariate normal (MVN) with E(Xj) = 0, 
Var(Xj) = 1, and coii{Xi,Xj) = 9, i / j, i, j = 1, . . . , fc, and e an indepen- 
dent standard normal variable. We choose 9 to produce a range of theoret- 
ical R^ = (Var(y) — (T^)/Var(y) values for (11) and a to give t values for 
our target regressors of about 6 under normality as in Ronchetti, Field and 
Blanchard (1997). The covariates Xi, . . . ^Xj. are our k target covariates. Let 
efe_|_i, . . . , Bp be independent standard normal variables and use the first 2k 
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to give the 2k covariates 

Xk+i = Xi + Xek+i, Xk+2 = Xi + Xek+2, 

Xk+3 = X2 + Xek+3, Xk+4 = ^2 + ^ek+4, 

^sk-i = ^k + ^e^k-i, X^k = Xk + Xesk', 

and the final p — 3k to give the p — 3k covariates 

Xi = ei, i = 3k + 1,. . . ,p. 

Variables X/^^i,. . . jX-^i^ are noise covariates that are correlated with our 
target covariates, and variables X^^+i, ■ ■ ■ ,Xp are independent noise covari- 
ates. Note that the covariates Xi, . . . ,Xp are then relabeled with a random 
permutation of 1 : p so that the target covariates do not appear in position 
1 : k, but rather in arbitrary positions. This is necessary to test the effec- 
tiveness of the streamwise variable selection, as covariates considered early 
on are favored for entry when many covariates are correlated. 

We consider samples without and with contamination. Samples with no 
contamination are generated using e ~ N(0, 1). To allow for 5% outliers, we 
generate using e ~ 95%N(0, 1) + 5%N(30, 1). These contaminated cases also 
have high leverage X-values: Xi, . . . , Xj. ~ MVN as before, except Var(Xj) = 
5, i = 1, . . . , /c. This represents the most difficult contamination scheme: large 
residuals at high leverage points. We also investigate the less challenging 
cases of 5% outlying in response only and 5% high leverage only. We choose 
A = 3.18 so that corr(Xi,Xfe+i) = corr(Xi,Xfc+2) = corr(X2,Xfc+3) = • • • = 
corr(Xfe,X3fc)=0.3. 

In all simulations we simulated n independent samples, with or with- 
out contamination, to use for variable selection. Then, another n indepen- 
dent samples without contamination were simulated for out-of-sample per- 
formance testing. The out-of-sample performance was evaluated using the 
mean sum of squared errors (MSE), J2i=n+i(yi ~ ^I f^)"^ /''^^ where /3 is the es- 
timated coefficient determined by the classical and robust VIF regression se- 
lection procedures or FRFS applied to the training set. Because the true pre- 
dictors are known, we also compute the out-of-sample performance measure 
using the true /3. Classical VIF selection was carried out using the VIF pack- 
age for R and default argument settings. Robust VIF was also implemented 
in R and code is available at http://neumann.hec.ca/pages/debbie.dupuis/ 
publicVIFfncs.R. FRFS is also implemented in R as outlined in Dupuis and 
Victoria-Feser (2011). 

It should be noted that when evaluating the performance of a given crite- 
rion (here a selection procedure), the evaluation measure should be chosen 
in accordance with the performance measure [see Gneiting (2011)]. In our 
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case, although the data are generated from contaminated conditional Gaus- 
sian models, the core model is still Gaussian and we wish to find the model 
that best predicts the conditional mean response. Consequently, a suitable 
performance measure is the expected squared error. However, when esti- 
mating the expected squared error from data, one can resort to the mean 
(i.e., the MSE) only if the data are purely issued from the postulated (core) 
Gaussian model. If this is not the case, or if there is no guarantee that this is 
the case, like, for example, with real data, then a more robust performance 
measure such as the median absolute prediction error (MAPE) should be 
chosen. Hence, in the simulations we use the MSE, while with real data sets 
we use the MAPE to estimate the evaluation measure for the comparison of 
the variable selection methods. 

Simulations results for n = 1000, k = 5, and p = 100 and p = 1000, are 
presented in Table 1 and Figures 1 and 2, respectively. Entries in the top 
panel of the table give the percentage of runs falling into each category. 
The category "Correct" means that the correct model was chosen. "Extra" 
means that a model was chosen for which the true model is a proper sub- 
set. "Missing 1" means that the model chosen differed from the true model 
only in that it was missing one of the target covariates; "Missing 2" and 
"Missing 3" are defined analogously. The Monte Carlo standard deviation 
of entries is bounded by 3.5%. We also report the empirical marginal false 
discovery rate (mFDR) mFUR = 'E{V)/(E{V) + £(8) + ri), where E{S) is 

the average number of true discoveries, E{V) is the average number of false 
discoveries, and rj = 10 is selected following Lin, Foster and Ungar (2011). 
We also report the required computation time. Note the particularly frugal 
robust approach to VIF regression: the cost of robustness is no more than a 
doubling of the computation time. 

Both VIF algorithms do not perform well in terms of the proportion of 
correctly selected models and the FRFS-Marginal procedure clearly outper- 
forms in this respect. The execution time of the FRFS-Marginal procedure, 
the faster of the two FRFS approaches presented in Dupuis and Victoria- 
Feser (2011), is roughly 25 times longer than that of the robust VIF proce- 
dure for these sizes of data sets. Both VIF algorithms do, however, choose a 
model for which the true model is a subset when there are no outliers. The 
classical VIF approach fails miserably in the presence of outliers (outlying 
response/high leverage), while the robust VIF approach is only slightly af- 
fected by the presence of outliers. The classical VIF approach is less affected 
by the presence of high leverage points only, but the effect is increased under 
more highly correlated regressors or a higher number of potential regres- 
sors. Results (not shown) for response variable only outliers are very similar 
to outlying response/high leverage outliers. Finally, other simulations (not 
shown) reveal that for less outlying contamination, the robust approaches 



Table 1 

Model selection results. Simulated data, as described in Section 3, have n — 1000 observations with p = 100 and p = 1000 potential 

regressors, including fc = 5 target regressors. Correlation among target regressors is 9 — 0.1 {R^ = 0.20) and 6 = 0.85 (i?^ — 0.80). 

Correlation among each target regressor and two other regressors is 0.3 m all cases. Remaining regressors are uncorrelated. Methods are 

classical (C) and robust (R) VIF regression, and FRFS-Marginal (F). Table entries are % of cases m categories listed m the first 

column. Empirical mFDR appears m the second to last row. Mean execution times (in seconds) appear in the last row. Data were either 

not contaminated, had 5% high leverage only (hi only), or 5% outliers (outlying response and high leverage). Results are based on 200 

simulations for each configuration 
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Fig. 1. Out-of-sample mean square errors of the models chosen by classical and ro- 
bust VIF regression, and FRF S- Marginal. Simulated data, as described in Section 3, have 
n = 1000 observations with p = 100 potential regressors, including k — 5 target regressors. 
Correlation among target regressors is 9 = 0.1 {R^ = 0.20) and 6 — 0.85 {R^ = 0.80). Cor- 
relation among each target regressor and two other regressors is 0. 3 in all cases. Remaining 
regressors are uncorrelated. Results are based on 200 simulations for each configuration. 
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Fig. 2. Out-of-sample mean square errors of the models chosen by classical and ro- 
bust VIF regression, and FRF S- Marginal. Simulated data, as described in Section 3, have 
n = 1000 observations with p — 1000 potential regressors, including k = 5 target repressors. 
Correlation among target regressors is 9 — 0.1 {R^ = 0.20) and 6 = 0.85 {R^ = 0.80). Cor- 
relation among each target regressor and two other regressors is 0. 3 in all cases. Remaining 
regressors are uncorrelated. Results are based on 200 simulations for each configuration. 
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always maintain good performance, while the negative impact on classical 
VIF is proportional to the level of outlyingness. 

As the simulated data sets have noise covariates that are correlated with 
target covariates, the poor performance in terms of %Correct is expected 
given the streamwise approach of VIF regressions. But as pointed out by 
Lin, Foster and Ungar (2011), the goal here is different: good fast out-of- 
sample prediction, that is, one sacrifices parsimony for speed. The stream- 
wise approach is fast and the main purpose of an a-investing control is to 
avoid model overfitting. We assess the latter through out-of-sample perfor- 
mance. Figure 1 shows out-of-sample MSE for the case p = 100. Robust 
VIF is as efficient as classical VIF when there are no outliers (top panel) 
and clearly outperforms classical VIF when there is 5% contaminated ob- 
servations (bottom panels). Robust VIF also loses very little with respect 
to FRFS-Marginal. Note that classical VIF seems to offer some resistance 
to contamination by high-leverage points only (as was also seen in Table 1), 
but completely falls apart in the presence of outlying response values, and 
this whether the outlying responses appear at high-leverage points or not. 
Much of the same can be seen in Figure 2 where results for the case p = 1000 
are shown. 

4. College data. Understanding the factors impacting an individual's 
educational attainment is a preoccupation for many governmental and non- 
governmental organizations. For example, a nation's government that recog- 
nizes the potential economic benefits of higher education will seek to write 
public policies to promote it. Private industry that benefits from a well- 
educated labor market will let it affect decision making, for example, a 
company may choose to establish itself where lifelong education is easily 
accessible to its personnel. Finally, an individual's family who associates 
personal achievement with higher levels of education may also act accord- 
ingly. 

Since the first work by Wetterlind (1976) on projecting community college 
enrollments in Arizona, many researchers have sought to identify the factors 
impacting educational attainment; see, for example, Pennington, McGinty 
and Williams (2002), Petrongolo and San Segundo (2002), Kienzl, Alfonso 
and Melguizo (2007), and Clark (2011) (and references therein) for a list of 
various studies. 

The data analyzed here are in the R package AER and are a subset of the 
data previously analyzed in Rouse (1995). There are 4739 observations on 14 
variables. The variables are listed in Table 2. We seek to predict the number 
of years of education using 13 economic and demographic variables. There 
are continuous and binary variables along with one categorical variable with 
three categories which is converted to two dummy variables. When consid- 
ering only first-order variables we thus have n = 4739 and p = 14; when 
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Table 2 
Original I4 variables in college data 

Variable Description 

gender Factor indicating gender. 

ethnicity Factor indicating etiinicity (African-American, Hispanic or other). 

score Base year composite test score. Tliese are aciiievement tests given 

to iiigii sciiool seniors in the sample, 

f college Factor. Is the father a college graduate? 

mcollege Factor. Is the mother a college graduate? 

home Factor. Does the family own their home? 

urban Factor. Is the school in an urban area? 

unemp County unemployment rate in 1980. 

wage State hourly wage in manufacturing in 1980. 

distance Distance from 4- year college (in 10 miles), 

tuition Average state 4- year college tuition (in 1000 USD), 

income Factor. Is the family income above USD 25,000 per year? 

region Factor indicating region (West or other), 

education Number of years of education. 



we include second-order interaction terms p rises to 104 (some interaction 
terms are constant and are removed). We have standardized the variables. 
Our analysis will show how classical, that is, nonrobust, VIF regression 
can be inadequate for the policy maker by failing to keep important fea- 
tures. 

The selected models are compared using the median absolute prediction 
error (MAPE), as measured by 10-fold CV. That is, we split the data into 
10 roughly equal-sized parts. For the A;th part, we carry out model selec- 
tion using the other nine parts of the data and calculate the MAPE of 
the chosen model when predicting the A;th part of the data. We do this 
for k = 1, . . . , 10 and show boxplots of the 10 estimates of the MAPE. For 
all methods, the data were split in the same way. For the college data, we 
randomly generated the folds. Note here that we look at MAPE instead of 
mean squared prediction error, as these real data can contain outliers (as 
opposed to the simulated data which were clean) and the MAPE is a better 
choice. 

For completeness, we compare the models selected by classical and robust 
VIF approaches with those of FRFS-Marginal and FRFS-Full where feasi- 
ble, as well as that of the popular least angle regression (LARS) of Efron 
et al. (2004), an extremely efficient algorithm for computing the entire Lasso 
[Tibshirani (1996)] path. We use the R package lars to do the computations. 

Tables 3 and 4 list the VIF and robust VIF regression selected features, 
along with estimated slopes, for the p = 14 and p = 104 scenarios, respec- 
tively. For both scenarios, the robust VIF regression approach selects slightly 
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Table 3 

VIF and robust VIF selected variables and estimated slope parameters (t-values) 

when only considering first-order terms. FRFS-Marginal and 

FRFS-Full selected variables and estimated slope parameters (t-values) are also shown. 

Significance: * 0.05, "0.01, *" 0.001 





VIF 


robVIF 


FRFS-Marg/FulI 


Variable 


^LS 


/3rob 


/3rob 


ethnicityaf am 


0.130 (5.28)*" 


0.129 (4.90)*** 


0.133 (5.16)*** 


ethnicityhispanic 


0.142 (5.97)"* 


0.124 (4.92)*** 


0.130 (5.19)*** 


score 


0.772 (31.3)*** 


0.820 (31.8)*** 


0.824 (31.9)*** 


fcollegeyes 


0.219 (8.40)*** 


0.233 (8.51)*** 


0.232 (8.52)*** 


mcollegeyes 


0.131 (5.25)*** 


0.146 (5.60)*** 


0.145 (5.55)*** 


homeyes 


0.054 (2.39)* 


0.057 (2.38)* 


0.057 (2.41)* 


urbanyes 


- 


0.024 (0.96) 


- 


unemp 


- 


0.077 (3.00)** 


0.075 (2.94)** 


wage 


- 


-0.064 (-2.56)* 


-0.062 (-2.50)* 


distance 


-0.064 (-2.81)** 


-0.083 (-3.22)*** 


-0.088 (-3.57)*** 


incomehigh 


0.163 (6.70)*** 


0.180 (7.07)*** 


0.183 (7.20)*** 


genderf emale 


- 


- 


0.066 (2.81)** 



Table 4 

VIF and robust VIF selected variables and estimated slope parameters (t-values) when 

including second-order interactions. Significance: * 0.05, "0.01, *** 0.001 
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/3l 



/3r 



ethnicityaf am 

ethnicityhispanic 

score 

fcollegeyes 

mcollegeyes 

homeyes 

urbanyes 

unemp 

wage 

distance 

incomehigh 

genderf emale : score 

genderf emale : fcollegeyes 

genderf emale : mcollegeyes 

score : incomehigh 

fcollegeyes : homeyes 

fcollegeyes: unemp 

fcollegeyes: wage 

fcollegeyes : tuition 

mcollegeyes : wage 



0.132 (5.39)*** 
-0.143 (6.02)*** 
0.772 (31.1)*** 
0.222 (8.52)*** 
0.056 (1.62) 
0.056 (2.46)* 



-0.062 (-2.75)* 
0.167 (6.87)*** 
0.030 (1.24) 
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0.06) 

3.43)*** 
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more, and/or slightly different, features. When considering only first-order 
terms, we see that the classical and robust estimates of commonly selected 
features are almost the same. This serves as a good form of validation for 
the relative importance of these features. However, the presence of outliers 
in the data has led classical VIF regression to completely miss two impor- 
tant features which are identified by robust VIF regression: unemp and wage. 
Even LS estimates (not shown) of the robust VIF regression selected model 
find these two features important with t-values of 3.15 and —2.70, but the 
classical VIF regression selection procedure could not detect this importance 
for the reasons outlined in the Introduction. FRFS-Marginal and FRFS-Full 
selected features are identical. The latter features, along with estimated 
slopes, are also shown in Table 3. 

VIF regression also misses the two important features in the p = 104 
scenario; see Table 4. As both the county unemployment rate and the state 
hourly wage in manufacturing are directly impacted by economic policy, pol- 
icy makers must be equipped with the best feature selection tools to have an 
effective strategy to reach sought after goals: in this case, increasing the level 
of education among its constituents. These tools, we argue, must include a 
robust selection procedure, as shown effectively by this example. Further 
evidence is given in Figure 3 where MAPE for VIF, robust VIF, FRFS- 



first-order only 



with interactions 



robVIF FRFS-M FRFS-F lasSO 



robVIF FRFS-M FRFS-F lassO 



Fig. 3. College data: Out-of-sample median absolute prediction errors of the models cho- 
sen by classical and robust VIF regression, FRFS-Margmal, FRFS-Full and the Lasso, in 
10-fold cross-validation. 
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Table 5 

Number of variables selected by VIF and robust VIF in 100 analyses of college data, each 

analysis with covariates presented in a random order 



# selected 


7 


8 


9 


10 


11 


12 


13 


14 


VIF 
robVIF 


11 

4 


29 
13 


24 

8 


22 
23 


10 

17 


3 

12 


1 
10 


13 



Marginal, and FRFS-Full and Lasso are shown for both scenarios. Robust 
VIF outperforms both of its nonrobust competitors, and even does better 
than FRFS-Marginal in the highly collinear case including interactions. It 
was shown in Dupuis and Victoria- Feser (2011) that FRFS-Marginal could 
select too few features in the highly collinear case and this motivated the 
development of FRFS-Full therein. 

As the solution for VIF and robust VIF regression can depend on the 
order of the covariates, we ran each procedure several times with the covari- 
ates presented in random order to investigate the stability of the selected 
models in terms of model size and prediction performance. Table 5 shows 
the distribution of the size of the selected model over 100 analyses and Ta- 
ble 6 shows how often each variable was selected over these 100 analyses. As 
expected, there is considerable variability in the size of the model, and this 
both in the classical and robust approaches. We see, however, that the dom- 
inating features are nearly always present. Note also that unemp and wage 
are selected twice as often in the robust approach compared to the classical 

Table 6 

Number of analyses where variable was selected by VIF and robust VIF in 100 analyses 

of college data, each analysis with covariates presented in a random order 

Variable VIF robVIF 

genderf emale 43 47 

ethnicityafam 100 100 

ethnicityhispanic 67 73 

score 100 100 

fcollegeyes 100 99 

mcollegeyes 100 100 

homey es 79 94 

urbanyes 3 38 

unemp 24 54 

wage 31 63 

distance 100 98 

tuition 26 56 

incomehigh 100 98 

regionwest 31 57 
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Fig. 4. College data: Out-of-sample MAPE of 10 random chosen analyses among 100 
analyses reported in Tables 5 and 6 for classical and robust VIF regression, MAPE calcu- 
lated based on 10-fold cross-validation. 

approach. In terms of prediction performance, we see in Figure 4 that the 
variabihty in the latter is considerably less, each of the 10 random analyses 
shown yielding more or less the same prediction performance despite the 
differences in terms of selected model size and features. 

5. Crime data. In this section we present a shorter analysis of another 
data set to show how the classical approach can even fail to give a usable 
result. Also, by looking at a considerably larger data set we can show how 
robust VIF provides robust prediction where no other robust method is 
feasible. 

We analyze recently made available crime data. These data are from the 
UCI Machine Learning Repository [Frank and Asuncion (2010)] and are 
available at http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime. 
We seek to predict the per capita violent crimes rate using economic, de- 
mographic, community, and law enforcement related variables. After remov- 
ing variables with missing data, we are left with n = 1994 observations on 
p = 97 first-order covariates. If we include second-order interactions (remov- 
ing those that are constant), we have p = 4753. In both cases, we stan- 
dardized the variables. VIF regression selects 33 and 1437 variables, in the 
respective scenarios, while robust VIF regression selects 20 variables in both 
cases. Classical VIF experiences problems with the larger data set, which 
contains outliers in a highly multicollinear setting, and chooses too many 
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first-order only with interactions 



Fig. 5. Crime and communities data: Out-of-sample median absolute prediction errors 
of the models chosen by classical and robust VIF regression, and the Lasso, in 10-fold 
cross-validation. * Results are not shown as VIF collapses m 4 folds, yielding MAPE of 
5.62, 6.55, 6.82, 9.41, and 15.1, respectively. Results for other folds were good, 0.0652, 
0.0676, 0.0686, 0.0694, 0.0744; but are excluded from the boxplot to allow for a better 
comparisons of all methods. 



covariates. This shows how the guarantee of no overfitting only holds at the 
model, that is, without any outliers in the data. For these data, robust VIF 
regression provides the only viable option for policy makers, as the 1437 fea- 
tures returned by classical VIF regression do not provide useful information. 
As can be seen in Figure 5, robust VIF is clearly the best performer for both 
scenarios. VIF regression chooses too many features for many of the folds 
and this leads to catastrophic results out-of-sample. 

6. Concluding remarks. In Lin, Foster and Ungar (2011) it was also 
shown that classical VIF regression equates or outperforms stepwise regres- 
sion, Lasso, FoBa, an adaptive forward-backward greedy algorithm focusing 
on linear models [Zhang (2009)], and GPS, the generalized path-seeking al- 
gorithm of Friedman (2008). In this paper we present a very efficient robust 
VIF approach that clearly outperforms classical VIF in the case of contami- 
nated data sets. This robust implementation comes with a very small cost in 
speed, computation time is less than doubled, and provides a much-needed 
robust model selection for large data sets. 
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APPENDIX: ALGORITHM ROBUST VIF REGRESSION 

The robust VIF regression procedure, based on a streamwise regression 
approach and a-investing, can be summarized by the following algorithm: 

Input data y,xi,X2, . . . (standardized) 

Set initial wealth ao = 0.50, pay-out Aa = 0.05, subsample size m, and ro- 
bustness constant c 

Compute efficiency e~^ where Cc is as in (8) 

Get all marginal weights Wij by fitting p marginal models y = /3oi + a^i/9i + 
ei , . . . , y = /3op + XpPp + Sp using (5) and (6) 

Initialize j = l,S = {0}, Xs = 1, X- = diag(y^)X5 and y- = diag(y^)y 

where w'^g is computed using (4) where r*^ = (y — 1(3^) /a^ using Xq' = 
X^2 = 1, 30 = [(X^)^X^]-i(X^2)Ty^^^gj.g^o ^ 1.483 med|fO-med(f 0)1 

andfO = y-l3°. 
repeat 

set aj = aj/{l+j - f) 

get Tu, from the five-step Fast Robust Evaluation Procedure in Section 2.3. 

if 2{1 - <^{\Ty,\)) < aj then 

S = SU{j}, Xs = [l X,], 



where w'^g is computed using (4) where r^ = [y — 'Ksi3^)/a^ using X^ = 
[1 ,/wUxij],Xf=[l «;i,-x,,-],i = l,...,n,3° = [(X^)^X-]-i(Xf)^y, 

where a^ = 1.483 med \r^ — med(rO)| and r^ = y — X^/?". 
flj+i = aj + Aa 

f = j 
else fflj+i = Oj — aj/{l — aj) 
end if 

until all p covariates have been considered. 
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